This document describes and demonstrates analysis methods used to analyze mouse hippocampal subregion-specific transcriptome RNA-seq data (Farris et al manuscript in preparation.)

Raw sequence read data is provided through GEO GSE116343, with RNA-seq data available at GEO GSE116342. Data used for analysis in R is provided through an R package farrisdata, installed in R with devtools::install_github("jmw86069/farrisdata").

Description of Input Data

The input data for this analysis workflow is derived from Salmon quantitation files, which were already imported and assembled into relevant data matrices.

Gene expression data

Salmon quantitation files were imported using the R Bioconductor package tximport, summarized to the gene level, and stored in an R object farrisGeneSE, which is available in the R data package farrisdata, installed using devtools::install_github("jmw86069/farrisdata").

## Load the gene data
library(farrisdata);
library(SummarizedExperiment);
farrisGeneSE;
## class: SummarizedExperiment 
## dim: 49341 24 
## metadata(4): design contrasts genes samples
## assays(2): counts raw_counts
## rownames(49341): -343C11.2 00R_AC107638.2 ... n-TSaga9 n-TStga1
## rowData names(4): probeID ProbeName GeneName geneSymbol
## colnames(24): CA2CB492 CA2CB496 ... DGDE496 DGDE502
## colData names(4): CellType Compartment AnimalID groupName

farrisGeneSE SummarizedExperiment format

farrisGeneSE is a SummarizedExperiment object with this content:

  • assays list containing the numeric matrix with gene rows, sample columns, with log2 median-normalized counts. Each row is named using a unique gene symbol derived from the Gencode GTF file. To access the normalized TPM data:
assays(farrisGeneSE)[["counts"]]
  • assays also contains the raw Salmon pseudocounts accessible, using this command: assays(farrisGeneSE)[["raw_counts"]]

  • rowData data.frame with gene rows, using the gene symbol derived from the Gencode GTF file.

as.data.frame(head(rowData(farrisGeneSE), 10)) %>%
   mutate() %>%
   select(probeID, geneSymbol, ends_with("_type"),
      contains("has")) %>%
   kable(escape=FALSE) %>%
   kable_styling()
probeID geneSymbol
-343C11.2 -343C11.2
00R_AC107638.2 00R_AC107638.2
00R_Pgap2 00R_Pgap2
0610005C13Rik 0610005C13Rik
0610006L08Rik 0610006L08Rik
0610007P14Rik 0610007P14Rik
0610009B22Rik 0610009B22Rik
0610009E02Rik 0610009E02Rik
0610009L18Rik 0610009L18Rik
0610009O20Rik 0610009O20Rik
  • colData annotated data.frame with sample rows, describing the mouse hippocampal Region, CellType, Sample, and groupName.
    CellType Compartment AnimalID groupName
    CA2CB492 CA2 CB 492 CA2_CB
    CA2CB496 496 CA2_CB
    CA2CB502 502 CA2_CB
    CA2DE492 DE 492 CA2_DE
    CA2DE496 496 CA2_DE
    CA2DE502 502 CA2_DE
    CA1CB492 CA1 CB 492 CA1_CB
    CA1CB496 496 CA1_CB
    CA1CB502 502 CA1_CB
    CA1DE492 DE 492 CA1_DE
    CA1DE496 496 CA1_DE
    CA1DE502 502 CA1_DE
    CA3CB492 CA3 CB 492 CA3_CB
    CA3CB496 496 CA3_CB
    CA3CB502 502 CA3_CB
    CA3DE492 DE 492 CA3_DE
    CA3DE496 496 CA3_DE
    CA3DE502 502 CA3_DE
    DGCB492 DG CB 492 DG_CB
    DGCB496 496 DG_CB
    DGCB502 502 DG_CB
    DGDE492 DE 492 DG_DE
    DGDE496 496 DG_DE
    DGDE502 502 DG_DE
    • metadata list describing the experimental design and contrasts used for statistical comparisons.
      • design numeric matrix describing the sample design, suitable for use directly by methods from the R Bioconductor limma package.
      • contrasts numeric matrix describing contrasts, suitable for use directly by limma.
      • samples vector of samples used.
      • genes vector of genes used.

Transcript expression data

Salmon quantitation files were imported using tximport as above, maintaining individual isoform abundances, then stored in an R object farrisTxSE, which is available in the R data package farrisdata, installed using devtools::install_github("jmw86069/farrisdata").

## Load the gene data
farrisTxSE;
## class: SummarizedExperiment 
## dim: 122733 24 
## metadata(4): design contrasts genes samples
## assays(4): tpm counts raw_tpm raw_counts
## rownames(122733): ENSMUST00000193812.1 ENSMUST00000082908.1 ...
##   ENSMUST00000189352.1 ENSMUST00000179623.1
## rowData names(10): transcript_id geneSymbol ... TxHasCDS
##   TxDetectedByTPM
## colnames(24): CA1CB492 CA1CB496 ... DGDE496 DGDE502
## colData names(4): CellType Compartment AnimalID groupName

farrisTxSE format

The farrisTxSE object is a named list with content similar to that described above for gene data:

  • assays list containing the numeric matrix with gene rows, sample columns, with log2 median-normalized counts. Each row is named using a unique gene symbol derived from the Gencode GTF file.
  • rowData data.frame with gene rows, using the gene symbol derived from the Gencode GTF file.
tx_df <- subset(as.data.frame(rowData(farrisTxSE)),
   geneSymbol %in% c("Gria1","Shank2","Ntrk2")) %>%
   select(geneSymbol,
      transcript_id,
      ends_with("_type"),
      contains("has"),
      contains("tpm"));
tx_df <- mixedSortDF(tx_df,
   byCols=match(c("geneSymbol","TxDetectedByTPM"),
     colnames(tx_df))*c(1,-1));
colorSubGene <- c(colorSub,
   group2colors(unique(tx_df$geneSymbol),
      cRange=c(20,30),
      lRange=c(80,90)));
tx_df2 <- kable_coloring(tx_df,
      colorSub=colorSubGene,
      row_color_by="geneSymbol",
      verbose=FALSE,
      returnType="kable") %>%
   collapse_rows(columns=2, valign="middle") %>%
   row_spec(0, background="#DDDDDD")

tx_df2;
geneSymbol transcript_id gene_type transcript_type has3UTR TxHas3UTR TxHasExt3UTR GeneHasExt3UTR TxHasCDS TxDetectedByTPM
15377 Gria1 ENSMUST00000036315.15 protein_coding protein_coding TRUE TRUE TRUE TRUE TRUE TRUE
15378 ENSMUST00000094179.10 protein_coding protein_coding TRUE TRUE TRUE TRUE TRUE TRUE
15376 ENSMUST00000125292.1 protein_coding protein_coding TRUE FALSE TRUE TRUE TRUE FALSE
15379 ENSMUST00000151045.2 protein_coding protein_coding TRUE TRUE TRUE TRUE TRUE FALSE
15380 ENSMUST00000173531.1 protein_coding processed_transcript TRUE FALSE FALSE TRUE FALSE FALSE
26395 Ntrk2 ENSMUST00000109838.8 protein_coding protein_coding TRUE TRUE TRUE TRUE TRUE TRUE
26396 ENSMUST00000079828.5 protein_coding protein_coding TRUE TRUE TRUE TRUE TRUE TRUE
100990 Shank2 ENSMUST00000105900.8 protein_coding protein_coding TRUE TRUE TRUE TRUE TRUE TRUE
100991 ENSMUST00000146006.2 protein_coding protein_coding TRUE TRUE TRUE TRUE TRUE TRUE
100986 ENSMUST00000105902.7 protein_coding protein_coding TRUE TRUE TRUE TRUE TRUE FALSE
100988 ENSMUST00000213146.1 protein_coding protein_coding TRUE FALSE TRUE TRUE TRUE FALSE
100992 ENSMUST00000097929.3 protein_coding protein_coding TRUE TRUE TRUE TRUE TRUE FALSE
100994 ENSMUST00000136979.1 protein_coding processed_transcript TRUE FALSE FALSE TRUE FALSE FALSE
  • colData annotated data.frame with sample rows, describing the mouse hippocampal Region, CellType, Sample, and groupName. This data is identical to the sample table shown for farrisGeneSE above.
  • metadata list describing the experimental design and contrasts used for statistical comparisons.
    • design numeric matrix describing the sample design, suitable for use directly by methods from the R Bioconductor limma package.
    • contrasts numeric matrix describing contrasts, suitable for use directly by limma.
    • samples vector of samples used.
    • genes vector of genes used.

Gene expression analysis summary

BGA plot

salmonBga1 <- runBga(exprs=assays(farrisGeneSE)$counts,
   signal="counts",
   normMethod="",
   groups=nameVector(colData(farrisGeneSE)$groupName,
      colnames(farrisGeneSE)),
   iSamples=nameVector(colnames(farrisGeneSE)),
   iGenes=rownames(farrisGeneSE),
   noiseFloor=10,
   emptyRowThreshhold=0.6,
   controlColumn="controlGroup",
   fixNames=FALSE,
   verbose=FALSE);
## ##  (15:33:59) 17Jul2018:  head(bgaInfo$bet$c1): 
##                    CS1    CS2    CS3     CS4   CS5       CS6     CS7
## 00R_AC107638.2 -0.0608  0.268 -0.123 -0.0806 0.161 -0.000882 0.00048
## 00R_Pgap2      -0.0608  0.268 -0.123 -0.0806 0.161 -0.000882 0.00048
## 0610005C13Rik  -0.0608  0.268 -0.123 -0.0806 0.161 -0.000882 0.00048
## 0610006L08Rik  -0.0608  0.268 -0.123 -0.0806 0.161 -0.000882 0.00048
## 0610007P14Rik   0.8251 -0.551 -0.628  0.4871 0.367  1.129838 0.13863
## 0610009B22Rik  -0.0608  0.268 -0.123 -0.0806 0.161 -0.000882 0.00048
salmonBga1ly <- bgaPlotly3d(salmonBga1$bgaInfo,
   axes=c(1,2,3),
   colorSub=colorSub,
   useScaledCoords=FALSE,
   drawVectors="none",
   drawSampleLabels=FALSE,
   superGroups=gsub("_.+", "", salmonBga1$bgaInfo$fac),
   ellipseType="none",
   sceneX=0, sceneY=1, sceneZ=1,
   verbose=FALSE);
salmonBga1ly;

Comparison with CA1 published data by Nakayama, Ainsley, Cajigas

First we calculate group medians using gene expression data, to determine the set of genes detected above an expression threshold of 128 pseudocounts (log2 = 7).

farrisGeneGroupMedians <- rowGroupMeans(assays(farrisGeneSE)[["counts"]],
   groups=colData(farrisGeneSE)$groupName,
   useMedian=TRUE);

# Plot histogram of expression in CA1_CB and CA1_DE
plotPolygonDensity(farrisGeneGroupMedians[,c("CA1_CB","CA1_DE")],
   xlim=c(0,20),
   ylim=c(0,700),
   ablineV=7);
## ##  (15:37:14) 17Jul2018:  Rescale density to histogram1 
## ##  (15:37:14) 17Jul2018:  maxHistY:700

## ##  (15:37:14) 17Jul2018:  Rescale density to histogram1 
## ##  (15:37:14) 17Jul2018:  maxHistY:700
# Detected genes in CA1_CB and CA1_DE
CA1detected <- (farrisGeneGroupMedians[,"CA1_CB"] > 7 &
   farrisGeneGroupMedians[,"CA1_DE"] > 7);
CA1detCBandDE <- rownames(farrisGeneGroupMedians)[CA1detected];
length(CA1detCBandDE);
## [1] 10877

Next we load previously gene lists representing detected CA1 dendrite genes from published studies from Nakayama et al, Ainsley et al, and Cajigas et al. These genes are available in the farrisdata package, as described above.

Produce a 4-way Venn diagram showing the overlapping genes from these studies.

GeneListL <- list(
  Farris=CA1detCBandDE,
  Cajigas=CajigasGenes,
  Ainsley=AinsleyGenes,
  Nakayama=NakayamaGenes);
GeneListIM <- list2im(GeneListL);
vps <- limma::vennDiagram(GeneListIM,
  circle.col=rainbowJam(4));

Correlation Centered by Compartment, refers to Figure 2

Cell Body

Per-sample correlation, centered across all CB samples. First, we use Salmon normalized pseudocounts, restricted to genes where at least one group mean value is above log2(7), which is >= 128 normalized pseudocounts.

Then data is centered per Compartment, so that CellBody samples are centered by subtracting the mean CellBody expression per gene, and Dendrite samples are centered by subtracting the mean Dendrite expression per gene. This step uses centerGeneData() with the argument centerGroups.

Next we prepare a data.frame with color coding to show the CellType and Compartment values.

## farrisGeneGroupMedians
## Pull out only cell body samples
iSamples <- colnames(farrisGeneSE);
iSamplesCB <- vigrep("CB", iSamples);
iSamplesDE <- vigrep("DE", iSamples);
iSamplesGrp <- colnames(iMatrix7grp);
iSamplesGrpCB <- vigrep("CB", iSamplesGrp);
iSamplesGrpDE <- vigrep("DE", iSamplesGrp);

corrCutoff <- 7;
genesAboveCutoff <- (rowMaxs(farrisGeneGroupMedians) >= corrCutoff);
genesAboveCutoffCB <- (rowMaxs(farrisGeneGroupMedians[,iSamplesGrpCB]) >= corrCutoff);
genesAboveCutoffDE <- (rowMaxs(farrisGeneGroupMedians[,iSamplesGrpDE]) >= corrCutoff);
genesAboveCutoffBoth <- (genesAboveCutoffCB & genesAboveCutoffDE);

iMatrix7 <- assays(farrisGeneSE[genesAboveCutoff,])[["counts"]];
centerGroups <- gsub("^.+(DE|CB).+$",
   "\\1",
   iSamples);
iMatrix7ctr <- centerGeneData(iMatrix7,
   centerGroups=centerGroups);
iMatrix7ctrCor <- cor(iMatrix7ctr);


## Generate some color bars to annotate the heatmap
colDataColorsRepL <- as.data.frame(colData(farrisGeneSE)[,c("CellType","Compartment")]);
colDataColorsRep <- df2groupColors(colDataColorsRepL,
  colorSub=colorSub);

We use heatmap.3() to draw the heatmap.

par("family"="Arial", "lend"="square", "ljoin"="mitre");
heatmap.3(iMatrix7ctrCor[iSamplesCB,iSamplesCB],
   margins=c(15,15),
   trace="none",
   col=colorRampPalette(rev(brewer.pal(15,"RdBu")))(51),
   add.expr=box(),
   cexCol=1.5,
   cexRow=1.5,
   keyLabel="Correlation",
   colLensFactor=10,
   distMethod="euclidean",
   hclustMethod="ward",
   doColorDendrograms=TRUE,
   ColSideColors=t(colDataColorsRep[iSamplesCB,2:1]),
   ColSideColorsNote=t(colDataColorsRepL[iSamplesCB,2:1]),
   RowSideColors=(colDataColorsRep[iSamplesCB,2:1]),
   RowSideColorsNote=(colDataColorsRepL[iSamplesCB,2:1]) );

Cell Body and Dendrite

Correlation showing CA2_CB correlates highest with CA2_DE, etc. for each CellType.

## farrisGeneGroupMedians[genesAboveCutoff,]
iMatrix7grp <- farrisGeneGroupMedians[genesAboveCutoffCB,];
centerGroupsGrp <- gsub("^.*(DE|CB).*$",
   "\\1",
   iSamplesGrp);
iMatrix7grpCtr <- centerGeneData(iMatrix7grp,
   centerGroups=centerGroupsGrp);
iMatrix7grpCtrCor <- cor(iMatrix7grpCtr);

## Generate some color bars to annotate the heatmap
colDataColorsGrpL <- data.frame(rbindList(
   strsplit(nameVector(iSamplesGrp), "_")));
colnames(colDataColorsGrpL) <- c("CellType","Compartment");
colDataColorsGrp <- df2groupColors(colDataColorsGrpL,
  colorSub=colorSub);

We use heatmap.3() to draw the heatmap.

par("family"="Arial", "lend"="square", "ljoin"="mitre");
heatmap.3(iMatrix7grpCtrCor[iSamplesGrp,iSamplesGrp],
   margins=c(15,15),
   trace="none",
   hclCutoff=70,
   cexColSideColorsNote=1.2,
   cexRowSideColorsNote=1.2,
   col=colorRampPalette(rev(brewer.pal(15,"RdBu")))(51),
   add.expr=box(),
   cexCol=1.5,
   cexRow=1.5,
   keyLabel="Correlation",
   colLensFactor=10,
   distMethod="euclidean",
   hclustMethod="ward",
   ColSideColors=t(colDataColorsGrp[iSamplesGrp,2:1]),
   ColSideColorsNote=t(colDataColorsGrpL[iSamplesGrp,2:1]),
   RowSideColors=(colDataColorsGrp[iSamplesGrp,2:1]),
   RowSideColorsNote=(colDataColorsGrpL[iSamplesGrp,2:1]) );

Splom plot, refers to Figure 2, and Supplemental Figure 3

Scatterplot showing the +1,+1 selection of genes, which helps define the gene lists in the next section.

corCols <- c("CA2_DE","CA2_CB");
cutCB <- log2(1.5);
cutDE <- log2(1.5);
splomL <- lapply(nameVector(c("CA1","CA2","CA3","DG")), function(k) {
   corCols <- vigrep(k, colnames(iMatrix7grpCtr));
   df1 <- as.data.frame(iMatrix7grpCtr[,corCols]);
   CBhit <- (abs(df1[,1]) > cutCB) * sign(df1[,1]);
   DEhit <- (abs(df1[,2]) > cutCB) * sign(df1[,2]);
   CBhitN <- paste0(k, "_", "CBhit");
   DEhitN <- paste0(k, "_", "DEhit");
   df1[,CBhitN] <- CBhit;
   df1[,DEhitN] <- DEhit;
   df1[,"GeneName"] <- rownames(df1);
   df1;
});
splomLsub <- lapply(splomL, function(iDF){
   subset(iDF, !iDF[,3] %in% 0 | !iDF[,4] %in% 0)
});

splomDF2 <- rbindList(lapply(splomLsub, function(iDF){
   iDF[,"CellType"] <- gsub("_.+", "", colnames(iDF)[1]);
   colnames(iDF) <- c("CB","DE","CBhit","DEhit","GeneName","CellType");
   iDF;
}));
#splomDF2[,"row"] <- ifelse(splomDF2$CellType %in% c("CA1","CA2"), "CA1", "CA3");
splomDF2[,"CBDEhit"] <- paste0(splomDF2$CBhit,":",splomDF2$DEhit);

## Color-code the splom points
ccColors <- nameVector(rep("grey", 8), unique(splomDF2$CBDEhit));
ccColors[c("1:1","-1:-1")] <- "orangered3";
ccColors[c("-1:1","1:-1")] <- "grey";

## Gene labels
GeneHighlight <- c("Plch2","Rgs14","Necab2","1700024P16Rik");
splomDF2$label <- ifelse(splomDF2$GeneName %in% GeneHighlight,
  splomDF2$GeneName, "");

gg2 <- ggplot(splomDF2,
      aes(x=CB, y=DE, color=CBDEhit, fill=CBDEhit, GeneName=GeneName)) +
   geom_point(shape=21, size=2) +
   geom_vline(xintercept=c(-1,1)*0.585,
      linetype="dashed", color="grey20") +
   geom_hline(yintercept=c(-1,1)*0.585, linetype="dashed", color="grey20") +
   facet_wrap(facets=~CellType) +
   theme_jam() +
   scale_fill_manual(values=ccColors) +
   scale_color_manual(values=makeColorDarker(ccColors)) +
   ggrepel::geom_label_repel(aes(label=label),
      force=8,
      fill="white") +
   theme(legend.position="none");
gg2;

Microarray correlation heatmap.

Neuronal Gene Lists

(Description of the four gene/transcript lists)

Heatmap per-gene.

Differential gene expression using limma

Transcript analysis summary

Definition of “detected transcripts”

We refer to the function defineDetectedTx() in the jampack R package. The function takes normalized counts, normalized TPM values, sample group information, and several cutoff values:

  • cutoffTxPctMax=10 requires a transcript isoform to be at least 10% of the highest isoform present in the same sample.
  • cutoffTxExpr=32 requires a transcript isoform to have at least 32 normalized counts.
  • cutoffTxTPMExpr=2 requires a transcript isoform to have at least a TPM value of 2.
  • All of the above criteria must be met for an isoform to be considered “detected.”
refreshFunctions("farrisSalmonWWS");
## ##  (17:41:10) 19Jun2018:  sourceFunctions(): Loading into environment name: farrisSalmonWWS_functions_funcEnv 
## ##  (17:41:10) 19Jun2018:     Overwriting existing environment name in .GlobalEnv: farrisSalmonWWS_functions_funcEnv 
## ##  (17:41:10) 19Jun2018:  Detatching position:4 
## ##  (17:41:10) 19Jun2018:  Refreshed functions from: farrisSalmonWWS_functions.R
detectedTxTPML <- defineDetectedTx(
   iMatrixTx=assays(farrisSalmonTxSE)[["counts"]],
   iMatrixTxTPM=assays(farrisSalmonTxSE)[["tpm"]],
   groups=colData(farrisSalmonTxSE)$groupName,
   cutoffTxPctMax=10,
   cutoffTxExpr=32,
   cutoffTxTPMExpr=2,
   tx2geneDF=renameColumn(rowData(farrisSalmonTxSE),
     from=c("geneSymbol","probeID"),
     to=c("gene_name","transcript_id")),
   useMedian=FALSE,
   verbose=FALSE);

The code above defined 27,756 detected transcripts by the given criteria.

Transcript type per gene list

e.g. protein_coding, etc.

Analysis of 3’UTR length

We imported Gencode vM12 comprehensive GTF into R using R Bioconductor GenomicFeatures package, with which we derived 3’UTR regions, using makeTxDbFromGFF() and threeUTRsByTranscript(), respectively.

We plotted 3’UTR lengths as violin plots using only detected transcripts based upon TPM criteria described elsewhere.

Translational Efficiency using CAI

Proximal-Distal 3’UTR Analysis

Differential transcript isoform analysis limma

Splicing, Sashimi plots

Gene Set Enrichment, MultiEnrichMap

Selection of gene hits

Used DESeq-normalized expression values.

Hypergeometric enrichment using MSigDB v6.0

Preparing MSigDB v6.0 data

Creating gmtT format, refreshing gene symbols in the gmtT data, and the RNAseq gene hit data to be fully in sync.

Commentary on alt gene symbols, conversion to official mouse Entrez gene symbols.

enrichSimple

Wrapper around standard hypergeometric enrichment tests, with a custom background of detected genes.

MultiEnrichMap

Subset pathways for top 15 per source-category

Heatmap of enrichment P-values.

Cnet plot